Analytics for Noisy Unstructured Text A Data II
نویسندگان
چکیده
The importance of text mining applications is growing proportionally with the exponential growth of electronic text. Along with the growth of internet many other sources of electronic text have become really popular. With increasing penetration of internet, many forms of communication and interaction such as email, chat, newsgroups, blogs, discussion groups, scraps etc. have become increasingly popular. These generate huge amount of noisy text data everyday. Apart from these the other big contributors in the pool of electronic text documents are call centres and customer relationship management organizations in the form of call logs, call transcriptions, problem tickets, complaint emails etc., electronic text generated by Optical Character Recognition (OCR) process from hand written and printed documents and mobile text such as Short Message Service (SMS). Though the nature of each of these documents is different but there is a common thread between all of these—presence of noise. An example of information extraction is the extraction of instances of corporate mergers, more formally MergerBetween(company1,company2,date), from an online news sentence such as: “Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp.” Opinion(product1,good), from a blog post such as: “I absolutely liked the texture of SheetK quilts.” At superficial level, there are two ways for information extraction from noisy text. The first one is cleaning text by removing noise and then applying existing state of the art techniques for information extraction. There in lies the importance of techniques for automatically correcting noisy text. In this chapter, first we will review some work in the area of noisy text correction. The second approach is to devise extraction techniques which are robust with respect to noise. Later in this chapter, we will see how the task of information extraction is affected by noise.
منابع مشابه
How Much Noise in Text is too Much: A Study in Automatic Document Classification
Noise is a stark reality in real life data. Especially in the domain of text analytics it has a significant impact as data cleaning forms a very large part (upto 80% time) of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech data, and automatically recognized text f...
متن کاملUsing Text Analytics to Derive Customer Service Management Benefits from Unstructured Data
The Growth of Text Analytics1 Estimates suggest that about 80% of today’s enterprise data is unstructured.2 Unlike structured data, which is tidy and mostly numeric, unstructured data is often textual and, therefore, messy. Unstructured data comprises documents, emails, instant messages or user posts and comments on social media, and presents a challenge to data miners; analyzing unstructured d...
متن کاملText Analytics to Data Warehousing
─ Information hidden or stored in unstructured data can play a critical role in making decisions, understanding and conducting other business functions. Integrating data stored in both structured and unstructured formats can add significant value to an organization. With the extent of development happening in Text Mining and technologies to deal with unstructured and semi structured data like X...
متن کاملIn-depth Interactive Visual Exploration for Bridging Unstructured and Structured Document Content
Semi-structured data refers to the combination of unstructured and structured data. Unstructured data is free text in natural language, while structured data is typically stored in tables and following a data schema. Recent statistics shows that 80% of the data generated in the last two years is unstructured. However, one interesting observation is that free text usually comes along with some s...
متن کاملData Management and Big Data Text Analytics
-------------------------------------------------------------------ABSTRACT------------------------------------------------------------Big data is now one of the most important technology trends that have the potential for changing the way organizations transform massive amounts of data into knowledge. It is a combination of data-management technologies that have evolved over time. It enables o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009